<>:4: SyntaxWarning:
invalid escape sequence '\l'
<>:4: SyntaxWarning:
invalid escape sequence '\l'
C:\Users\jt-la\AppData\Local\Temp\ipykernel_16632\2876480636.py:4: SyntaxWarning:
invalid escape sequence '\l'
C:\Users\jt-la\AppData\Local\Temp\ipykernel_16632\2876480636.py:4: DtypeWarning:
Columns (19,30) have mixed types. Specify dtype option on import or set low_memory=False.
C:\Users\jt-la\AppData\Local\Temp\ipykernel_16632\2876480636.py:14: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
0.1 Dropping Unncessary Columns
Answer the Questions + Which columns are irrelevant or redundant? + Why are we removing multiple versions of NAICS/SOC codes? + How will this improve analysis?
Answer the question: How should missing values be handled?
import missingno as msnoimport matplotlib.pyplot as plt# Visualize missing datamsno.heatmap(data)plt.title("Missing Values Heatmap")plt.show()# Fill missing valuesdata["SALARY"].fillna(data["SALARY"].median(), inplace=True)data["NAICS_2022_6_NAME"].fillna("Unknown", inplace=True)
C:\Users\jt-la\AppData\Local\Temp\ipykernel_16632\1556623005.py:10: FutureWarning:
A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
C:\Users\jt-la\AppData\Local\Temp\ipykernel_16632\1556623005.py:11: FutureWarning:
A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
0.3 Remove Duplicates
data = data.drop_duplicates(subset=["TITLE", "COMPANY", "LOCATION", "POSTED"], keep="first")
1 Exploratory Data Analysis
1.1 Job Postings by Industry
import plotly.express as pxfig = px.bar(data["NAICS_2022_6_NAME"].value_counts(), title="Job Postings by Industry")fig.show()
Job Posting Anaylsis
1.2 Salary Distribution by Industry
fig = px.box(data, x="NAICS_2022_6_NAME", y="SALARY", title="Salary Distribution by Industry")fig.show()
Job Posting Analysis
1.3 Remote vs. On-Site Jobs
fig = px.pie(data, names="REMOTE_TYPE_NAME", title="Remote vs. On-Site Jobs")fig.show()